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Abstract 

This paper introduces the Deep Recurrent Atten¬ 
tive Writer (DRAW) neural network architecture 
for image generation. DRAW networks combine 
a novel spatial attention mechanism that mimics 
the foveation of the human eye, with a sequential 
variational auto-encoding framework that allows 
for the iterative construction of complex images. 
The system substantially improves on the state 
of the art for generative models on MNIST, and, 
when trained on the Street View House Numbers 
dataset, it generates images that cannot be distin¬ 
guished from real data with the naked eye. 


1. Introduction 

A person asked to draw, paint or otherwise recreate a visual 
scene will naturally do so in a sequential, iterative fashion, 
reassessing their handiwork after each modification. Rough 
outlines are gradually replaced by precise forms, lines are 
sharpened, darkened or erased, shapes are altered, and the 
final picture emerges. Most approaches to automatic im¬ 
age generation, however, aim to generate entire scenes at 
once. In the context of generative neural networks, this typ¬ 
ically means that all the pixels are conditioned on a single 
latent distribution (Dayan et al., 1995; Hinton & Salakhut- 
dinov, 2006; Larochelle & Murray, 2011). As well as pre¬ 
cluding the possibility of iterative self-correction, the “one 
shot” approach is fundamentally difficult to scale to large 
images. The Deep Recurrent Attentive Writer (DRAW) ar¬ 
chitecture represents a shift towards a more natural form of 
image construction, in which parts of a scene are created 
independently from others, and approximate sketches are 
successively refined. 
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Figure 1 . A trained DRAW network generating MNIST dig¬ 
its. Each row shows successive stages in the generation of a sin¬ 
gle digit. Note how the lines composing the digits appear to be 
“drawn” by the network. The red rectangle delimits the area at¬ 
tended to by the network at each time-step, with the focal preci¬ 
sion indicated by the width of the rectangle border. 


The core of the DRAW architecture is a pair of recurrent 
neural networks: an encoder network that compresses the 
real images presented during training, and a decoder that 
reconstitutes images after receiving codes. The combined 
system is trained end-to-end with stochastic gradient de¬ 
scent, where the loss function is a variational upper bound 
on the log-likelihood of the data. It therefore belongs to the 
family of variational auto-encoders, a recently emerged 
hybrid of deep learning and variational inference that has 
led to significant advances in generative modelling (Gre¬ 
gor et al., 2014; Kingma & Welling, 2014; Rezende et al., 
2014; Mnih & Gregor, 2014; Salimans et al., 2014). Where 
DRAW differs from its siblings is that, rather than generat- 
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ing images in a single pass, it iteratively constructs scenes 
through an accumulation of modifications emitted by the 
decoder, each of which is observed by the encoder. 

An obvious correlate of generating images step by step is 
the ability to selectively attend to parts of the scene while 
ignoring others. A wealth of results in the past few years 
suggest that visual structure can be better captured by a se¬ 
quence of partial glimpses, or foveations, than by a sin¬ 
gle sweep through the entire image (Larochelle & Hinton, 
2010; Denil et al., 2012; Tang et al., 2013; Ranzato, 2014; 
Zheng et al., 2014; Mnih et al., 2014; Ba et al., 2014; Ser- 
manet et al., 2014). The main challenge faced by sequential 
attention models is learning where to look, which can be 
addressed with reinforcement learning techniques such as 
policy gradients (Mnih et al., 2014). The attention model in 
DRAW, however, is fully differentiable, making it possible 
to train with standard backpropagation. In this sense it re¬ 
sembles the selective read and write operations developed 
for the Neural Turing Machine (Graves et al., 2014). 

The following section defines the DRAW architecture, 
along with the loss function used for training and the pro¬ 
cedure for image generation. Section 3 presents the selec¬ 
tive attention model and shows how it is applied to read¬ 
ing and modifying images. Section 4 provides experi¬ 
mental results on the MNIST, Street View House Num¬ 
bers and CIFAR-10 datasets, with examples of generated 
images; and concluding remarks are given in Section 5. 
Lastly, we would like to direct the reader to the video 
accompanying this paper (https ://www. youtube . 
com/watch?v=Zt-7MI9eKEo) which contains exam¬ 
ples of DRAW networks reading and generating images. 


2. The DRAW Network 

The basic structure of a DRAW network is similar to that of 
other variational auto-encoders: an encoder network deter¬ 
mines a distribution over latent codes that capture salient 
information about the input data; a decoder network re¬ 
ceives samples from the code distribuion and uses them to 
condition its own distribution over images. However there 
are three key differences. Firstly, both the encoder and de¬ 
coder are recurrent networks in DRAW, so that a sequence 
of code samples is exchanged between them; moreover the 
encoder is privy to the decoder’s previous outputs, allow¬ 
ing it to tailor the codes it sends according to the decoder’s 
behaviour so far. Secondly, the decoder’s outputs are suc¬ 
cessively added to the distribution that will ultimately gen¬ 
erate the data, as opposed to emitting this distribution in 
a single step. And thirdly, a dynamically updated atten¬ 
tion mechanism is used to restrict both the input region 
observed by the encoder, and the output region modified 
by the decoder. In simple terms, the network decides at 
each time-step “where to read” and “where to write” as well 
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Figure 2. Left: Conventional Variational Auto-Encoder. Dur¬ 
ing generation, a sample z is drawn from a prior P{z) and passed 
through the feedforward decoder network to compute the proba¬ 
bility of the input P{x\z) given the sample. During inference the 
input X is passed to the encoder network, producing an approx¬ 
imate posterior Q{z\x) over latent variables. During training, 
is sampled from Q{z\x) and then used to compute the total de¬ 
scription length KL(^Q{Z\x)\\P{Z)) — \og{P{x\z)), which is 
minimised with stochastic gradient descent. Right: DRAW Net¬ 
work. At each time-step a sample Zt from the prior P{zt) is 
passed to the recurrent decoder network, which then modifies part 
of the canvas matrix. The final canvas matrix ct is used to com¬ 
pute P{x\zi:t)- During inference the input is read at every time- 
step and the result is passed to the encoder RNN. The RNNs at 
the previous time-step specify where to read. The output of the 
encoder RNN is used to compute the approximate posterior over 
the latent variables at that time-step. 


as “what to write”. The architecture is sketched in Fig. 2, 
alongside a feedforward variational auto-encoder. 

2.1. Network Architecture 

Let RNN^'^^ be the function enacted by the encoder net¬ 
work at a single time-step. The output of RNN^'^^ at time 
t is the encoder hidden vector . Similarly the output of 
the decoder RNN^^^ at t is the hidden vector In gen¬ 
eral the encoder and decoder may be implemented by any 
recurrent neural network. In our experiments we use the 
Long Short-Term Memory architecture (LSTM; Hochreiter 
& Schmidhuber (1997)) for both, in the extended form with 
forget gates (Gers et al., 2000). We favour LSTM due 
to its proven track record for handling long-range depen¬ 
dencies in real sequential data (Graves, 2013; Sutskever 
et al., 2014). Throughout the paper, we use the notation 
b = IF (a) to denote a linear weight matrix with bias from 
the vector a to the vector b. 

At each time-step t, the encoder receives input from both 
the image x and from the previous decoder hidden vector 
. The precise form of the encoder input depends on a 
read operation, which will be defined in the next section. 
The output of the encoder is used to parameterise a 
distribution Q{Zt\hf^^) over the latent vector Zt. In our 
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experiments the latent distribution is a diagonal Gaussian 

l/it, cTt): 

= W{hT^) (1) 

at=e^-p{W{hr^)) ( 2 ) 

Bernoulli distributions are more common than Gaussians 
for latent variables in auto-encoders (Dayan et al., 1995; 
Gregor et al., 2014); however a great advantage of Gaus¬ 
sian latents is that the gradient of a function of the sam¬ 
ples with respect to the distribution parameters can be eas¬ 
ily obtained using the so-called reparameterization trick 
(Kingma & Welling, 2014; Rezende et al., 2014). This 
makes it straightforward to back-propagate unbiased, low 
variance stochastic gradients of the loss function through 
the latent distribution. 


negative log probability of x under D\ 

=-log D{x\ct) (9) 

The latent loss for a sequence of latent distributions 
is defined as the summed Kullback-Leibler di¬ 
vergence of some latent prior P{Zt) from Q{Zt\h^'^^)\ 

T 

= Y,KL{Q{Zt\hZ^)\\P{Zt)) (10) 

t=l 

Note that this loss depends upon the latent samples Zt 
drawn from Q{Zt\hl'^^), which depend in turn on the input 
X. If the latent distribution is a diagonal Gaussian with jit, 
cFt as defined in Eqs 1 and 2, a simple choice for P{Zt) is 
a standard Gaussian with mean zero and standard deviation 
one, in which case Eq. 10 becomes 


At each time-step a sample Zt ^ Q{Zt\h^'^^) drawn from 
the latent distribution is passed as input to the decoder. The 
output of the decoder is added (via a write opera¬ 
tion, defined in the sequel) to a cumulative canvas matrix 
Ct, which is ultimately used to reconstruct the image. The 
total number of time-steps T consumed by the network be¬ 
fore performing the reconstruction is a free parameter that 
must be specified in advance. 

Eor each image x presented to the network, Co, 
are initialised to learned biases, and the DRAW net¬ 
work iteratively computes the following equations for t = 


= x- (T{ct-l) 

( 3 ) 

= read{xt,xt,h^l\) 

( 4 ) 

= RNN^^^ih^Zi, [n, hfl\]) 

( 5 ) 

r^Q{Zt\hD 

( 6 ) 


( 7 ) 

= Ct-i + write{hf^‘^) 

( 8 ) 


where Xt is the error image, [v^w] is the concatenation 
of vectors v and w into a single vector, and cr denotes 
the logistic sigmoid function: a{x) = Note 

that and hence Q{Zt\hl'^^), depends on both x 

and the history zi^t-i of previous latent samples. We 
will sometimes make this dependency explicit by writing 
Q{Zt\x^ zi:t-i), as shown in Eig. 2. can also be 

passed as input to the read operation; however we did not 
find that this helped performance and therefore omitted it. 


The total loss C for the network is the expectation of the 
sum of the reconstruction and latent losses: 

£ = (£" + (12) 

which we optimise using a single sample of 2 : for each 
stochastic gradient descent step. 

can be interpreted as the number of nats required to 
transmit the latent sample sequence zi^t to the decoder 
from the prior, and (if x is discrete) is the number of 
nats required for the decoder to reconstruct x given zi^t- 
The total loss is therefore equivalent to the expected com¬ 
pression of the data by the decoder and prior. 

2.3. Stochastic Data Generation 

An image x can be generated by a DRAW network by it¬ 
eratively picking latent samples Zt from the prior P, then 
running the decoder to update the canvas matrix q. After T 
repetitions of this process the generated image is a sample 
from D{X\ct)'. 

Zt - P{Zt) (13) 

J^dec ^ ^ (^ 4 ^ 

ct = ct-i + write{hf’^^) (15) 

X ~ D{X\ct) (16) 

Note that the encoder is not involved in image generation. 


2.2. Loss Function 

The final canvas matrix ct is used to parameterise a model 
D{X\ct) oi the input data. If the input is binary, the natural 
choice for P is a Bernoulli distribution with means given 
by cr(cT). The reconstruction loss CP is defined as the 


3. Read and Write Operations 

The DRAW network described in the previous section is 
not complete until the read and write operations in Eqs. 4 
and 8 have been defined. This section describes two ways 
to do so, one with selective attention and one without. 
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3.1. Reading and Writing Without Attention 

In the simplest instantiation of DRAW the entire input im¬ 
age is passed to the encoder at every time-step, and the de¬ 
coder modifies the entire canvas matrix at every time-step. 
In this case the read and write operations reduce to 


read(x,xt,hfl\) = [x,xt] (17) 

write{hf^‘=) = (18) 

However this approach does not allow the encoder to fo¬ 
cus on only part of the input when creating the latent dis¬ 
tribution; nor does it allow the decoder to modify only a 
part of the canvas vector. In other words it does not pro¬ 
vide the network with an explicit selective attention mech¬ 
anism, which we believe to be crucial to large scale image 
generation. We refer to the above configuration as “DRAW 
without attention”. 

3.2. Selective Attention Model 

To endow the network with selective attention without sac¬ 
rificing the benefits of gradient descent training, we take in¬ 
spiration from the differentiable attention mechanisms re¬ 
cently used in handwriting synthesis (Graves, 2013) and 
Neural Turing Machines (Graves et al., 2014). Unlike 
the aforementioned works, we consider an explicitly two- 
dimensional form of attention, where an array of 2D Gaus¬ 
sian filters is applied to the image, yielding an image 
‘patch’ of smoothly varying location and zoom. This con¬ 
figuration, which we refer to simply as “DRAW”, some¬ 
what resembles the affine transformations used in computer 
graphics-based autoencoders (Tieleman, 2014). 

As illustrated in Fig. 3, the x grid of Gaussian filters is 
positioned on the image by specifying the co-ordinates of 
the grid centre and the stride distance between adjacent fil¬ 
ters. The stride controls the ‘zoom’ of the patch; that is, the 
larger the stride, the larger an area of the original image will 
be visible in the attention patch, but the lower the effective 
resolution of the patch will be. The grid centre {gx^dv) 
and stride 6 (both of which are real-valued) determine the 
mean location , (ly ^^w i, column j in the 

patch as follows: 







Figure 3. Left: A 3 x 3 grid of filters superimposed on an image. 
The stride (S) and centre location (gx^gv) are indicated. Right: 
Three N x N patches extracted from the image (N = 12). The 
green rectangles on the left indicate the boundary and precision 
(a) of the patches, while the patches themselves are shown to the 
right. The top patch has a small S and high a, giving a zoomed-in 
but blurry view of the centre of the digit; the middle patch has 
large S and low a, effectively downsampling the whole image; 
and the bottom patch has high S and a. 


via a linear transformation of the decoder output 

{9x,gY,loga^,\og5,\og'y) = 

A + 1 

9x = -x^{gx + 1 ) 

9Y = 

max(A, B) — 1~ 

'*=— —I — 

where the variance, stride and intensity are emitted in the 
log-scale to ensure positivity. The scaling of gx , gY and S 
is chosen to ensure that the initial patch (with a randomly 
initialised network) roughly covers the whole input image. 

Given the attention parameters emitted by the decoder, the 
horizontal and vertical filterbank matrices Fx and Fy (di¬ 
mensions N X A and N x B respectively) are defined as 
follows: 


( 21 ) 

( 22 ) 

(23) 

(24) 


g)c=9xF{i-N/2-0.b)S (19) 

gi^ = gYF{j-N/2-0.5)S (20) 

Two more parameters are required to fully specify the at¬ 
tention model: the isotropic variance of the Gaussian 
filters, and a scalar intensity 7 that multiplies the filter re¬ 
sponse. Given 3.n A x B input image x, all five attention 
parameters are dynamically determined at each time step 


= ( 25 ) 

= ) ( 26 ) 

where (i, j) is a point in the attention patch, (a, b) is a point 
in the input image, and , Zy are normalisation constants 
that ensure that Fx [i, a] = 1 and Fy [j, b] = 1. 
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Figure 4. Zooming. Top Left: The original 100 x 75 image. Top 
Middle: A 12 x 12 patch extracted with 144 2D Gaussian filters. 
Top Right: The reconstructed image when applying transposed 
filters on the patch. Bottom: Only two 2D Gaussian filters are 
displayed. The first one is used to produce the top-left patch fea¬ 
ture. The last filter is used to produce the bottom-right patch fea¬ 
ture. By using different filter weights, the attention can be moved 
to a different location. 


3.3. Reading and Writing With Attention 

Given Fx , Fy and intensity 7 determined by , along 
with an input image x and error image Xt, the read opera¬ 
tion returns the concatenation of two N x N patches from 
the image and error image: 

read{x, Xt, hfl\) = j[FyxFx, FyxFx] (27) 

Note that the same filterbanks are used for both the image 
and error image. For the write operation, a distinct set of 
attention parameters 7 , Fx and Fy are extracted from 
the order of transposition is reversed, and the intensity is 
inverted: 

Wt = (28) 

write{hf‘^‘) = Ip^WtFx (29) 

7 


generated by the network are always novel (not simply 
copies of training examples), and are virtually indistin¬ 
guishable from real data for MNIST and SVHN; the gener¬ 
ated CIFAR images are somewhat blurry, but still contain 
recognisable structure from natural scenes. The binarized 
MNIST results substantially improve on the state of the art. 
As a preliminary exercise, we also evaluate the 2D atten¬ 
tion module of the DRAW network on cluttered MNIST 
classification. 

For all experiments, the model D{X\ct) of the input data 
was a Bernoulli distribution with means given by (t{ct). 
For the MNIST experiments, the reconstruction loss from 
Eq 9 was the usual binary cross-entropy term. For the 
SVHN and CIFAR-10 experiments, the red, green and blue 
pixel intensities were represented as numbers between 0 
and 1 , which were then interpreted as independent colour 
emission probabilities. The reconstruction loss was there¬ 
fore the cross-entropy between the pixel intensities and the 
model probabilities. Although this approach worked well 
in practice, it means that the training loss did not corre¬ 
spond to the true compression cost of RGB images. 

Network hyper-parameters for all the experiments are 
presented in Table 3. The Adam optimisation algo¬ 
rithm (Kingma & Ba, 2014) was used throughout. Ex¬ 
amples of generation sequences for MNIST and SVHN 
are provided in the accompanying video (https : / /www. 
youtube . com/watch?v=Zt-7MI 9eKEo). 

4.1. Cluttered MNIST Classification 

To test the classification efficacy of the DRAW attention 
mechanism (as opposed to its ability to aid in image gener¬ 
ation), we evaluate its performance on the 100 x 100 clut¬ 
tered translated MNIST task (Mnih et ak, 2014). Each im¬ 
age in cluttered MNIST contains many digit-like fragments 
of visual clutter that the network must distinguish from the 
true digit to be classified. As illustrated in Eig. 5, having 
an iterative attention model allows the network to progres¬ 
sively zoom in on the relevant region of the image, and 
ignore the clutter outside it. 


where Wt is the x writing patch emitted by . Eor 
colour images each point in the input and error image (and 
hence in the reading and writing patches) is an RGB triple. 
In this case the same reading and writing filters are used for 
all three channels. 

4. Experimental Results 

We assess the ability of DRAW to generate realistic- 
looking images by training on three datasets of progres¬ 
sively increasing visual complexity: MNIST (LeCun et al., 
1998), Street View House Numbers (SVHN) (Netzer et al., 
2011) and CIEAR-10 (Krizhevsky, 2009). The images 


Our model consists of an LSTM recurrent network that re¬ 
ceives a 12 X 12 ‘glimpse’ from the input image at each 
time-step, using the selective read operation defined in Sec¬ 
tion 3.2. After a fixed number of glimpses the network uses 
a softmax layer to classify the MNIST digit. The network 
is similar to the recently introduced Recurrent Attention 
Model (RAM) (Mnih et al., 2014), except that our attention 
method is differentiable; we therefore refer to it as “Differ¬ 
entiable RAM”. 

The results in Table 1 demonstrate a significant improve¬ 
ment in test error over the original RAM network. More¬ 
over our model had only a single attention patch at each 
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Figure 5. Cluttered MNIST classification with attention. Each 
sequence shows a succession of four glimpses taken by the net¬ 
work while classifying cluttered translated MNIST. The green 
rectangle indicates the size and location of the attention patch, 
while the line width represents the variance of the filters. 


Table 1. Classification test error on 100 x 100 Cluttered Trans¬ 
lated MNIST. 


Model 

Error 

Convolutional, 2 layers 

14.35% 

RAM, 4 glimpses, 12 x 12, 4 scales 

9.41% 

RAM, 8 glimpses, 12 x 12, 4 scales 

8.11% 

Differentiable RAM, 4 glimpses, 12 x 12 

4.18% 

Differentiable RAM, 8 glimpses, 12 x 12 

3.36% 


time-step, whereas RAM used four, at different zooms. 

4.2. MNIST Generation 

We trained the full DRAW network as a generative model 
on the binarized MNIST dataset (Salakhutdinov & Mur¬ 
ray, 2008). This dataset has been widely studied in the 
literature, allowing us to compare the numerical perfor¬ 
mance (measured in average nats per image on the test 
set) of DRAW with existing methods. Table 2 shows that 
DRAW without selective attention performs comparably to 
other recent generative models such as DARN, NADE and 
DBMS, and that DRAW with attention considerably im¬ 
proves on the state of the art. 


Table 2. Negative log-likelihood (in nats) per test-set example on 
the binarised MNIST data set. The right hand column, where 
present, gives an upper bound (Eq. 12) on the negative log- 
likelihood. The previous results are from [1] (Salakhutdinov & 
Hinton, 2009), [2] (Murray & Salakhutdinov, 2009), [3] (Uria 
et al., 2014), [4] (Raiko et al., 2014), [5] (Rezende et al., 2014), 
[6] (Salimans et al., 2014), [7] (Gregor et al., 2014). 


Model 

— log p < 

DBM 2hl [1] 

^ 84.62 

DBN 2hl [2] 

84.55 

NADE [3] 

88.33 

EoNADE 2hl (128 orderings) [3] 

85.10 

EoNADE-5 2hl (128 orderings) [4] 

84.68 

DLGM [5] 

^ 86.60 

DLGM 8 leapfrog steps [6] 

^ 85.51 88.30 

DARN Ihl [7] 

^ 84.13 88.30 

DARN 12hl [7] 

87.72 

DRAW without attention 

87.40 

DRAW 

80.97 
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Figure 6. Generated MNIST images. All digits were generated 
by DRAW except those in the rightmost column, which shows the 
training set images closest to those in the column second to the 
right (pixelwise is the distance measure). Note that the net¬ 
work was trained on binary samples, while the generated images 
are mean probabilities. 

Once the DRAW network was trained, we generated 
MNIST digits following the method in Section 2.3, exam¬ 
ples of which are presented in Fig. 6. Fig. 7 illustrates 
the image generation sequence for a DRAW network with¬ 
out selective attention (see Section 3.1). It is interesting to 
compare this with the generation sequence for DRAW with 
attention, as depicted in Fig. 1. Whereas without attention 
it progressively sharpens a blurred image in a global way. 
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Time 


Figure 7. MNIST generation sequences for DRAW without at¬ 
tention. Notice how the network first generates a very blurry im¬ 
age that is subsequently refined. 


with attention it constructs the digit by tracing the lines— 
much like a person with a pen. 

4.3. MNIST Generation with Two Digits 

The main motivation for using an attention-based genera¬ 
tive model is that large images can be built up iteratively, 
by adding to a small part of the image at a time. To test 
this capability in a controlled fashion, we trained DRAW 
to generate images with two 28 x 28 MNIST images cho¬ 
sen at random and placed at random locations in a 60 x 60 
black background. In cases where the two digits overlap, 
the pixel intensities were added together at each point and 
clipped to be no greater than one. Examples of generated 
data are shown in Fig. 8. The network typically generates 
one digit and then the other, suggesting an ability to recre¬ 
ate composite scenes from simple pieces. 

4.4. Street View House Number Generation 

MNIST digits are very simplistic in terms of visual struc¬ 
ture, and we were keen to see how well DRAW performed 
on natural images. Our first natural image generation ex¬ 
periment used the multi-digit Street View House Numbers 
dataset (Netzer et al., 201 1). We used the same preprocess¬ 
ing as (Goodfellow et al., 2013), yielding a 64 x 64 house 
number image for each training example. The network was 
then trained using 54 x 54 patches extracted at random lo¬ 
cations from the preprocessed images. The SVHN training 
set contains 231,053 images, and the validation set contains 
4,701 images. 

The house number images generated by the network are 
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Figure 8. Generated MNIST images with two digits. 
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Figure 9. Generated SVHN images. The rightmost column 
shows the training images closest (in distance) to the gener¬ 
ated images beside them. Note that the two columns are visually 
similar, but the numbers are generally different. 

highly realistic, as shown in Figs. 9 and 10. Fig. 11 reveals 
that, despite the long training time, the DRAW network un¬ 
derfit the SVHN training data. 

4.5. Generating CIFAR Images 

The most challenging dataset we applied DRAW to was 
the CIFAR-10 collection of natural images (Krizhevsky, 
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Table 3. Experimental Hyper-Parameters. 


Task 

#glimpses 

LSTM #h 

#z 

Read Size 

Write Size 

100 X 100 MNIST Classification 

8 

256 

- 

12 X 12 

- 

MNIST Model 

64 

256 

100 

2x2 

5x5 

SVHN Model 

32 

800 

100 

12 X 12 

12 X 12 

CIFAR Model 

64 

400 

200 

5x5 

5x5 
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Figure 1 0. SVHN Generation Sequences. The red rectangle in¬ 
dicates the attention patch. Notice how the network draws the dig¬ 
its one at a time, and how it moves and scales the writing patch to 
produce numbers with different slopes and sizes. 



Figure 11. Training and validation cost on SVHN. The valida¬ 
tion cost is consistently lower because the validation set patches 
were extracted from the image centre (rather than from random 
locations, as in the training set). The network was never able to 
overfit on the training data. 



Figure 12. Generated CIFAR images. The rightmost column 
shows the nearest training examples to the column beside it. 


looking objects without overfitting (in other words, without 
copying from the training set). Nonetheless the images in 
Fig. 12 demonstrate that DRAW is able to capture much of 
the shape, colour and composition of real photographs. 

5. Conclusion 

This paper introduced the Deep Recurrent Attentive Writer 
(DRAW) neural network architecture, and demonstrated its 
ability to generate highly realistic natural images such as 
photographs of house numbers, as well as improving on the 
best known results for binarized MNIST generation. We 
also established that the two-dimensional differentiable at¬ 
tention mechanism embedded in DRAW is beneficial not 
only to image generation, but also to image classification. 
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2009). CIFAR-10 is very diverse, and with only 50,000 
training examples it is very difficult to generate realistic- 
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